Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log compaction failure error and delete temporarily blocks from disk #2261

Merged
merged 3 commits into from
Jun 29, 2022

Conversation

pracucci
Copy link
Collaborator

What this PR does

I've seen a compactor failing to compact some blocks, but the reason is missing from logs because we're not logging the actual error (the error returned by runCompactionJob() is never logged).

Once the compaction fails, source blocks are left on disk for later investigation. While this is a nice thing, it triggers another issue: subsequent job executions (other jobs) may run out of disk space, because previous failing compaction run jobs (failed) are left on disk. This is an issue which is also happening on the cluster I'm investigating.

In this PR I propose to fix both.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci requested a review from pstibrany June 28, 2022 13:16
pracucci added 2 commits June 28, 2022 15:17
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@replay
Copy link
Contributor

replay commented Jun 28, 2022

I agree that we shouldn't just keep those source blocks of failed compactions forever, but if we now make it impossible to analyze the source blocks of failed compactions because they always get deleted isn't it possible that in the future we'll end up in a situation where we'll miss having them because some compaction failed and we don't know why?

Would it make sense to add a runtime config flag which can make the compactor optionally not delete the source blocks of failed compactions of a specific tenant? That way if we do need to investigate some failing compactions we could just enable that flag to make the compactor not delete the source blocks of failed compactions of this tenant.

I'm not sure anymore, but I thought in the past we have encountered situations where the ability to analyze the source blocks of failed compactions has been useful to understand an issue, or am I misremembering that?

@pstibrany
Copy link
Member

I'm not sure anymore, but I thought in the past we have encountered situations where the ability to analyze the source blocks of failed compactions has been useful to understand an issue, or am I misremembering that?

Source blocks are still available in the bucket.

@pracucci
Copy link
Collaborator Author

pracucci commented Jun 28, 2022

I'm not sure anymore, but I thought in the past we have encountered situations where the ability to analyze the source blocks of failed compactions has been useful to understand an issue, or am I misremembering that?

Source blocks are still available in the bucket.

As @pstibrany mentioned, you can download source blocks from object storage anytime. I think having blocks in the container disk is not much useful anyway. To debug them you will have to download them into a workstation where you have all your debugging tooling, so why not downloading them directly from the object storage (which is also faster to download).

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch with unlogged error.

Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pracucci pracucci merged commit e9bba6b into main Jun 29, 2022
@pracucci pracucci deleted the fix-compactor-on-error branch June 29, 2022 05:29
masonmei pushed a commit to udmire/mimir that referenced this pull request Jul 11, 2022
…rafana#2261)

* Log compaction failure error and delete temporarily blocks from disk

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Well, we have to always delete local dir

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fix unit tests

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants